Method and system to mark an audio signal with metadata

ABSTRACT

A method of processing an audio signal comprises receiving an audio signal, extracting features from the audio signal, and translating the extracted features into metadata. The metadata comprises an instruction set of a markup language. A system for processing the audio signal is also disclosed, which comprises an input device for receiving the audio signal and a processor for extracting the features from the audio signal and for translating the extracted features into the metadata.

The present invention relates to a method and system for processing anaudio signal in accordance with extracted features of the audio signal.The present invention has particular, but not exclusive, applicationwith systems that determine and extract musical features of an audiosignal such as tempo and key. The extracted features are translated intometadata.

Ambient environment systems that control the environment are known from,for example, our United States patent application publication U.S.2002/0169817, which discloses a real-world representation system thatcomprises a set of devices, each device being arranged to provide one ormore real-world parameters, for example audio and visualcharacteristics. At least one of the devices is arranged to receive areal-world description in the form of an instruction set of a markuplanguage and the devices are operated according to the description.General terms expressed in the language are interpreted by either alocal server or a distributed browser to operate the devices to renderthe real-world experience to the user.

United States patent application publication U.S. 2002/0169012 disclosesa method of operating a set of devices that comprises receiving asignal, for example at least part of a game world model from a computerprogram. The signal is analysed to produce a real-world description inthe form of an instruction set of a markup language and the set ofdevices is, operated according to the description.

It is desirable to provide a method of automatically generatinginstruction sets of the markup language from an audio signal.

According to a first aspect of the present invention there is provided amethod of processing an audio signal comprising receiving an audiosignal, extracting features from the audio signal, and translating theextracted features into metadata, the metadata comprising an instructionset of a markup language.

According to a second aspect of the present invention there is provideda system for processing an audio signal, comprising an input device forreceiving an audio signal and a processor for extracting features fromthe audio signal and for translating the extracted features intometadata, the metadata comprising an instruction set of a markuplanguage.

Owing to the invention, it is possible to generate automatically from anaudio signal metadata that is based upon the content of the audiosignal, and can be used to control an ambient environment system.

The method advantageously further comprises storing the metadata. Thisallows the user the option of reusing the metadata that has beenoutputted, for example by transmitting it to a location that does nothave the processing power to execute the feature extraction from theaudio signal. Preferably, the storing comprises storing the metadatawith associated time data, the time data defining the start time and theduration, relative to the received audio signal, of each markup languageterm in the instruction set. By storing time data with the metadata thatis synchronised to the original audio signal the metadata, when reusedwith the audio signal, defines an experience that is time dependent, butthat also matches the original audio signal.

Advantageously, the method further comprises transmitting theinstruction set to a browser, and also further comprising receivingmarkup language assets. Preferably the method also further comprisesrendering the markup language assets in synchronisation with thereceived audio signal. In this way, the metadata is used directly forproviding the ambient environment. The browser receives the instructionset and the markup language assets and renders the assets insynchronisation with the outputted audio, as directed by the instructionset.

The features extracted from the audio signal, in a preferred embodiment,include one or more of tempo, key and volume. These features define abroad sense, aspects of the audio signal. They indicate such things asmood, which can then be used to define metadata that will determine theambient environment to augment the audio signal.

The present invention will now be described, by way of example only, andwith reference to the accompanying drawings in which:

FIG. 1 is a schematic representation of a system for processing an audiosignal,

FIG. 2 is a flow chart of a method of processing an audio signal, and

FIG. 3 is a schematic representation of storing metadata with associatedtime data.

FIG. 1 shows a schematic representation of a system 100 for processingan audio signal. The system 100 consists of a processor (CPU) 102connected to memory (ROM) 104 and memory (RAM) 106 via a general isdata-bus 108. Computer code or software 110 on a carrier 112 may beloaded into the RAM 106 (or alternatively provided in the ROM 104), thecode causing the processor 102 to perform instructions embodying theprocessing method. Additionally, the processor 102 is connected to astore 114, to output devices 116, 118, and to an input device 122. Auser interface (UI) 120 is also provided.

The system 100 may be embodied as a conventional home personal computer(PC) with the output device 116 taking the form of a computer monitor ordisplay. The store 114 may be a remote database available over a networkconnection. Alternatively, if the system 100 is embodied in a homenetwork, the output devices 116, 118 may be distributed around the homeand comprise, for example, a wall mounted flat panel display, computercontrolled home lighting units, and/or audio speakers. The connectionsbetween the processor 102 and the output devices 116, 118 may bewireless (for example communications via radio standards WiFi orBluetooth) and/or wired (for example communications via wired standardsEthernet, USB).

The system 100 receives an input of an audio signal (such as a musictrack from a CD) from which musical features are extracted. In thisembodiment, the audio signal is provided via an internal input device122 of the PC such as a CD/DVD or hard disc drive. Alternatively, theaudio signal may be received via a connection to a networked homeentertainment system (Hi-Fi, home cinema etc). Those skilled in the artwill realise that the exact hardware/software configuration andmechanism of provision of an audio signal is not important, rather thatsuch signals are made available to the system 100.

The extraction of musical features from an audio signal is described inthe paper “Querying large collections of music for similarity” (MattWelsh et al, UC Berkeley Technical Report UCB/CSD-00-1096 November 1999.The paper describes how features such as an average tempo, volume,noise, and tonal transitions can be determined from analysing an inputaudio signal. A method for determining the musical key of an audiosignal is described in the U.S. Pat. No. 5,038,658.

The input device 122 is for receiving the audio signal and the processor102 is for extracting features from the audio signal and for translatingthe extracted features into metadata, the metadata comprising aninstruction set of a markup language. The processor 102 receives theaudio signal and extracts musical features such as volume, tempo, andkey as described in the aforementioned references. Once the processor102 has extracted the musical features from the audio signal, theprocessor 102 translates those musical features into metadata. Thismetadata will be in the form of very broad expressions such as <SUMMER>or <DREAMY POND>. The translation engine within the processor 102operates either a defined series of algorithms to generate the metadataor is in the form of a “neural network” arrangement to produce themetadata from the extracted features. The resulting metadata is in theform of an instruction set of a markup language.

The system 100 further comprises a browser 124 (shown schematically inFIG. 2) that is distributed amongst a set of devices, the browser 124being arranged to receive the instruction set of the markup language andto receive markup language assets and to control the set of devicesaccordingly. The set of devices that are being controlled by the browser124 may include the output devices 116 and 118, and/or may includefurther devices remote from the system. Together these devices make upan ambient environment system, the various output devices 116, 118 beingcompliant with a markup language and instruction set designed to deliverreal world experiences.

An example of such a language is physical markup language (PML),described in the Applicants co-pending applications referred to above.PML includes a means to author, communicate and render experiences to anend user so that the end user experiences a certain level of immersionwithin a real physical space. For example, PML enabled consumer devicessuch as an audio system and lighting system can receive instructionsfrom a host network device (which instructions may be embedded within aDVD video stream for example) that causes the lights or sound outputfrom the devices to be modified. Hence a dark scene in a movie causesthe lights in the consumer's home to darken appropriately.

PML is in general a high level descriptive mark-up language, which maybe realised in XML with descriptors that relate to real world events,for example, <FOREST>. Hence, PML enables devices around the home toaugment an experience for a consumer in a standardised fashion.

Therefore the browser 124 receives the instruction set, which mayinclude, for example, <SUMMER> and <EVENING>. The browser also receivesmarkup language assets 126, which will be at least one asset for eachmember of the instruction set. So for <SUMMER> there may be a video filecontaining a still image and also a file containing colour definition.For <EVENING> there may be similarly files containing data for colour,still image and/or moving video. As the original music is played (orreplayed), the browser 124 renders the associated markup language assets126, so that the colours and images are rendered by each device,according to the capability of each device in the set.

FIG. 2 summarises the method of processing the audio signal, whichcomprises receiving 200 an audio signal, extracting 202 features fromthe audio signal, and translating 204 the extracted features intometadata, the metadata comprising an instruction set of a markuplanguage. The audio signal is received from a CD, via the input device122 of FIG. 1. The steps of extracting 202 the musical features of theaudio signal and translating 204 the features into the appropriatemetadata are carried out within the processor 102 of the system ofFIG. 1. The output of the feature extraction 202 is a meta-descriptionabout the received audio signal. The structure of the meta-descriptionwill depend upon the nature of the extraction system being used by theprocessor 102. A relatively simple extraction system will return adescription such as Key: A minor; Mean volume: 8/10; Standard deviationof volume: +/−2. A more complicated system would be able to returnextremely detailed information about the audio signal including changesof the features over time within the piece of music that is beingprocessed.

The method can further comprise the step 206 of storing the metadata.This is illustrated in FIG. 3. The storing can comprise storing themetadata 302 with associated time data 304. In the situation where anadvanced feature extraction system is used at step 202, which returnsdata that is time dependent, the metadata that is output from thetranslator can also be time dependent.

For example, there may be a defined change of mood in the piece of musicthat makes up the audio signal. The translator may represent this withthe terms <SUMMER> and <AUTUMN>, with a defined point when <SUMMER> endin the music and <AUTUMN> begins. The time data 146 that is stored candefine the start time and the duration, relative to the received audiosignal, of each markup language term in the instruction set. In theexample used in FIG. 3, the term <SUMMER> is shown to have a start time(S) of 0, referring to the time in seconds after the start of the pieceof music and a duration (D) of 120 seconds. The other two terms shownhave different start and duration times as defined by the translator. InFIG. 3, the arrow 306 shows the output from the translator.

The method can further comprise transmitting 208 the instruction set tothe browser 124. As discussed relative to the system of FIG. 1, thebrowser 124 can also receive (step 210) markup language assets 126. Thebrowser 124 is arranged to render (step 212) the markup language assets126 in synchronisation with the received audio signal.

1. A method of processing an audio signal comprising acts of: receivingan audio signal, extracting musical features from the audio signal,translating the extracted musical features into metadata, the metadatacomprising an instruction set of a markup language, transmitting theinstruction set to a browser, storing the metadata with associated timedata, the time data defining a start time and a duration, relative tothe audio signal, of each of a plurality of markup language terms of theinstruction set, the time data synchronizing the metadata to thereceived audio signal, receiving markup language assets, and renderingthe markup language assets in synchronization with the received audiosignal, the synchronization matching the metadata to the received audiosignal.
 2. The method according to claim 1, wherein the musical featuresextracted from the audio signal include one or more of tempo, key andvolume.
 3. A system for processing an audio signal, comprising: an inputdevice for receiving an audio signal; a processor for extracting musicalfeatures from the audio signal and for translating the extracted musicalfeatures into metadata, the metadata comprising an instruction set of amarkup language; a memory operably coupled to the processor for storingthe metadata with time data, the time data defining a start time and aduration, relative to the audio signal, of each of a plurality of markuplanguage terms of the instruction set, the time data enablingsynchronizing the metadata to the received audio signal, an outputdevice for outputting the received audio signal; and a browserdistributed amongst a set of devices, the browser arranged to receive aninstruction set of the markup language and markup language assets and tocontrol the set of devices, thereby rendering the markup language assetsin synchronization with the received audio signal.
 4. The systemaccording to claim 3, further comprising an output device for outputtingthe received audio signal.
 5. A method of processing an audio signalcomprising acts of: receiving an audio signal, extracting musicalfeatures from a plurality of portions of the audio signal, translatingthe extracted musical features from the plurality of portions intocorresponding metadata, the metadata comprising an instruction set of amarkup language corresponding to real world descriptions, storing inmemory the metadata corresponding to each of the plurality of audiosignal portions; storing time data in memory in association with each ofa plurality of markup language terms of the instruction set, the timedata comprising a start time and a duration relative to a correspondingportion of the audio signal, receiving markup language assets, andrendering markup language assets as identified by the metadata terms insynchronization with the plurality of corresponding portions of thereceived audio signal.